March 18, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
Recurrent Neural Networks (RNNs) are the backbone of sequence modeling
Inputs or outputs make sense temporally
Knowing what came before or what comes after should inform what we predict at a current point in time
Adds knowledge to the NN architecture
Units need to talk backwards in time - all of the previous values need to influence what I guess next!
Note, we’re going to use time pretty loosely here - just some sort of meaningful temporal flow
If the order of the input matters, then we should preserve that information!
We can process a sequence of vectors, \(\mathbf x\), by applying a recurrence formula at every time step:
\[ h_t = f_w(h_{t - 1}, x_t) \]
\(h_{t - 1}\) is the old state (some vector of numbers)
\(x_t\) is the input state at time \(t\) (some vector)
\(f_w()\) is a function of some parameters, \(\mathbf W\) (some collection of vectors)
\(h_t\) is the new state (some vector)
The hidden state vector contains latent information about the current state of the model
For language models, it could tell us about the gender of pronouns in the sentence
Or the passiveness of the tone
This information should persist throughout the sequence of predictions!
The vanilla update:
\[ \underset{(m \times 1)}{\mathbf h_t} = \text{tanh}(\underset{(m \times m)}{\mathbf W_{hh}} \underset{(m \times 1)}{\mathbf h_{t-1}} + \underset{(m \times P)}{\mathbf W_{xh}} \underset{(P \times 1)}{\mathbf x_t} + \underset{(m \times 1}{\mathbf b_h}) \]
\(\mathbf h_t\) and \(\mathbf h_{t-1}\) are hidden vectors of size \(m\)
\(\mathbf W_{hh}\) is a \(m \times m\) matrix of weights that controls the mapping of one hidden state vector to the next
\(\mathbf W_{xh}\) is a \(m \times P\) matrix of weights that controls the mapping of the input to the hidden state
\(\mathbf b_h\) is a vector of bias terms
\(\text{tanh}()\) is an activation function
The RNN is said to have infinite memory
What it saw before will always play a part (albeit with diminishing influence) in what it predicts next!
Recurrence!
Today, we’re going to look at the most popular RNN framework - Seq2Seq
Any sequence of inputs to any sequence of outputs
Machine translation (French to English)
Character Addition
General Question and Answer!
Seq2Seq is a generic approach to the sequence-to-sequence prediction problem!
For each output, we look at:
The previous decoder state
The previous output
The final encoder state
Problem: The input sequence is bottlenecked through the final encoder state
Say a 512 vector that describes all of the language we previously saw
What if the length of the input was 1000?
Consider the aligned case - we \(\rightarrow\) estamos
For large language models, seq2seq would need to summarize all of the inputs that it saw in the training stage in terms of a single vector of length 512
Large for small problems, but quite small for the entirety of the English language
Try to imagine all of the books you’ve ever read.
Imagine all of that literature compressed to a single 512 vector of numerical representations
Not really a good strategy…
Attention is a mechanism that allows the decoder to leverage all parts of the original inputs
At different steps of the decoder, let the model “focus” on different parts of the input!
Input: “The agreement on the European Economic Area was signed in August 1992”
Output: “L’accord sur la zone economique europeenne a ete signe en aout 1992.”
Input: “The agreement on the European Economic Area was signed in August 1992”
Output: “L’accord sur la zone economique europeenne a ete signe en aout 1992.”
The decoder doesn’t use the fact that the encoder hidden states form an ordered sequence!
Order is baked into the hidden states via the encoder
How do we compute the attention weights?
\[ e_{ti} = f(s_{t-1}, h_i) \]
\[ a_{ti} = \sigma(e_{ti}) \]
The attention scores should measure the level of similarity between the current context of the decoder and the original input states
Give more weight if the current context should leverage a specific part of the input
At time \(t\) of the decoder, create a weight for each state of the encoder
A quick note:
Without loss, we can assume that the hidden state vectors of the encoder \(\mathbf h_i\) and the decoder \(\mathbf s_t\) are normalized to have norm 1
These are just vectors, so we can always force them to have a specific size
No change in direction, just change in norm
Additionally, lets just assume that the length of each \(\mathbf h_i\) and \(\mathbf s_t\) are the same!
Goal: Find the key that is most similar to the query
Method 1: Vanilla MLP (Bahdanau attention)
\[ e_{ti} = \mathbf w_2^T \odot \text{tanh}\left(\mathbf W_1 [\mathbf h_i , \mathbf s_{t - 1}] \right) \]
Find similarity as a prediction problem and use a 1 layer NN to get the values
Concatenate the encoder and decoder hidden states and learn a weight mapping
Softmax over all hidden states to get weights
Requires learning two weight matrices
Generalizes to case where the encoder and decoder hidden states are of different lengths
Goal: Find the key that is most similar to the query
\[ \mathbf h_i \in \mathbb R^m \text{ ; } \mathbf s_{t-1} \in \mathbb R^m \]
Method 2: Learnable Scaled Dot Product
\[ e_{ti} = \frac{\mathbf h_i^T \mathbf W_g \mathbf s_{t-1}}{\sqrt{m}} \]
Find the inner product of the two vectors and divide by the length to scale
Learn a \(m \times m\) matrix that controls the mapping!
Putting this all together, we get a model for each output of the decoder!
Place weights where needed to map from one set of attention weighted encoder to a decoder output
Note that the decoder output is a weighted combination of the previous decoder state and the attention weighted encoder state
A recurrent weighted scheme for producing contextual seq2seq models
Fully differentiable, so not too much to cover in terms of fitting
Just let autodiff handle it!
\[ \mathbf y_{t} = f(\mathbf s_t) \text{ ; } \mathbf s_t = f(\mathbf y_{t-1},c_t) \]
\[ \mathbf c_t = \sum \limits_{i = 1}^{T_e} a_{ti} h_i \text{ ; } a_{ti} = \sigma(e_{ti}) \]
\[ e_{ti} = \frac{\mathbf h_i^T \mathbf W_g \mathbf s_{t-1}}{\sqrt{m}} \]
where \(\mathbf h_i\) is learned via a RNN setup!
RNNs are sequential models by nature
Do we even need to do it that way?
Suppose that we have an input sentence:
I arrived at the bank after crossing the street.
At \(t = 8\) of the encoder, we see:
I arrived at the bank after crossing the []
Two completely viable sentences:
Street
River
There is no hope for a sequential model to learn what should come next!
What we need is the ability to look ahead when we encode the input
Attention Layers will require us to change our terminology a little
Query: vector from which the attention is looking. In the seq2seq model, this is the decoder hidden state \(\mathbf s_{t-1}\)
Key: vector at which the query looks to compute weights. \(\mathbf h\) - all the hidden states in the encoder
Value: the vector to be weighted by the similarity between the query and the key - also the hidden states in the encoder
What if we didn’t need to even encode ordering?
Treat everything as a QKV problem
Given a query, compare it to all keys
Compute an attention weight
Use that to get the context and then make a prediction
Broader attention definition than seq2seq
Inputs:
Queries: \(\mathbf Q\) - a \(N_Q \times m\) matrix ( \(N_Q\) is the number of hidden states we need to see in the decoder).
Input vectors: \(\mathbf X\) - a \(N_X \times D_X\) matrix of input values where each row corresponds to a sequence of length \(P\) (think a word embedding for words in sentences)
Key weights: \(\mathbf W_K\) - a \(D_X \times N_Q\) matrix of weights
Value weights: \(\mathbf W_V\) - a \(D_X \times D_V\) matrix where \(D_V\) is the length of the output vectors (again, think about the output vectors as embedded versions of words)
Computations:
Key vectors: \(\mathbf K = \mathbf X \mathbf W_K\) a \((N \times m)\) matrix
Value vectors: \(\mathbf V = \mathbf X \mathbf W_V\) a \((N \times D_v)\) matrix
Similarities: \(\mathbf E = \frac{\mathbf Q \mathbf K^T}{\sqrt{m}}\) a \((N_Q \times N_X)\) matrix
Attention weights: \(\mathbf A = \sigma (\mathbf E)\) - normalized over the rows.
Output vectors: \(\mathbf Y = \mathbf A \mathbf V\) a \((N_Q \times D_V)\) matrix
Everything is differentiable
This model is close to the seq2seq model, but it doesn’t respect order
An in between model:
Encoder that only uses attention (more or less a complicated flat neural network)
Decoder that is sequential that references the encoder via attention
This model doesn’t quite outperform the seq2seq model yet.
For that, we’ll need to add three things (next time):
Self-attention
Masked self-attention
Multiheaded self-attention